Prediction error estimation: a comparison of resampling methods

نویسندگان

  • Annette M. Molinaro
  • Richard Simon
  • Ruth M. Pfeiffer
چکیده

MOTIVATION In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. RESULTS For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. SUPPLEMENTARY INFORMATION A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semiparametric Bootstrap Prediction Intervals in time Series

One of the main goals of studying the time series is estimation of prediction interval based on an observed sample path of the process. In recent years, different semiparametric bootstrap methods have been proposed to find the prediction intervals without any assumption of error distribution. In semiparametric bootstrap methods, a linear process is approximated by an autoregressive process. The...

متن کامل

Ideal bootstrap estimation of expected prediction error for k-nearest neighbor classifiers: Applications for classification and error assessment

Euclidean distance -nearest neighbor ( -NN) classifiers are simple nonparametric classification rules. 5 5 Bootstrap methods, widely used for estimating the expected prediction error of classification rules, are motivated by the objective of calculating the ideal bootstrap estimate of expected prediction error. In practice, bootstrap methods use Monte Carlo resampling to estimate the ideal boot...

متن کامل

Nonparametric Error Estimation Methods for Evaluating and Validating Artificial Neural Network Prediction Models

Typically the true error of ANN prediction model is estimated by testing the trained network on new data not used in model construction. Four well-studied statistical error estimation methods: cross-validation, group cross-validation, jackknife and bootstrap are reviewed and are presented as competing error estimation methodologies that could be used to evaluate and validate ANN prediction mode...

متن کامل

BIOINFORMATICS Prediction Error Estimation: A Comparison of Resampling Methods

Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection, and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the ’...

متن کامل

Resampling Based Empirical Prediction: An Application to Small Area Estimation

Best linear unbiased prediction is well known for its wide range of applications including small area estimation. While the theory is well established for mixed linear models and under normality of the error and mixing distributions, the literature is sparse for nonlinear mixed models under nonnormality of the error or of the mixing distributions. This article develops a resampling based unifie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 21 15  شماره 

صفحات  -

تاریخ انتشار 2005